training signal
- Media > Music (0.46)
- Leisure & Entertainment (0.46)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Germany (0.04)
- Asia > Singapore (0.04)
g-DPO: Scalable Preference Optimization for Protein Language Models
Ferragu, Constance, Ziegler, Jonathan D., Deutschmann, Nicolas, Lindoulsi, Arthur, Bixby, Eli, Team, Cradle ML
Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in silico and in vitro performance that is statistically indistinguishable from standard DPO, while converging 1.7x to 5.4x times faster, with speedups that scale with dataset size and the structure of the underlying mutational landscape.
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
Leveraging Reinforcement Learning, Genetic Algorithms and Transformers for background determination in particle physics
Mendizabal, Guillermo Hijano, Lancierini, Davide, Marshall, Alex, Mauri, Andrea, Owen, Patrick Haworth, Patel, Mitesh, Petridis, Konstantinos, Qasim, Shah Rukh, Serra, Nicola, Sutcliffe, William, Tilquin, Hanae
Experimental studies of beauty hadron decays face significant challenges due to a wide range of backgrounds arising from the numerous possible decay channels with similar final states. For a particular signal decay, the process for ascertaining the most relevant background processes necessitates a detailed analysis of final state particles, potential misidentifications, and kinematic overlaps, which, due to computational limitations, is restricted to the simulation of only the most relevant backgrounds. Moreover, this process typically relies on the physicist's intuition and expertise, as no systematic method exists. This paper has two primary goals. First, from a particle physics perspective, we present a novel approach that utilises Reinforcement Learning (RL) to overcome the aforementioned challenges by systematically determining the critical backgrounds affecting beauty hadron decay measurements. While beauty hadron physics serves as the case study in this work, the proposed strategy is broadly adaptable to other types of particle physics measurements. Second, from a Machine Learning perspective, we introduce a novel algorithm which exploits the synergy between RL and Genetic Algorithms (GAs) for environments with highly sparse rewards and a large trajectory space. This strategy leverages GAs to efficiently explore the trajectory space and identify successful trajectories, which are used to guide the RL agent's training. Our method also incorporates a transformer architecture for the RL agent to handle token sequences representing decays.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- North America > Canada (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.88)
Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models
Liu, Chenxi, Liang, Junjie, Jia, Yuqi, Cao, Bochuan, Bai, Yang, Huang, Heng, Chen, Xun
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
- North America > United States > Pennsylvania (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
Learning Beyond Experience: Generalizing to Unseen State Space with Reservoir Computing
Norton, Declan A., Zhang, Yuanzhao, Girvan, Michelle
Machine learning techniques offer an effective approach to modeling dynamical systems solely from observed data. However, without explicit structural priors -- built-in assumptions about the underlying dynamics -- these techniques typically struggle to generalize to aspects of the dynamics that are poorly represented in the training data. Here, we demonstrate that reservoir computing -- a simple, efficient, and versatile machine learning framework often used for data-driven modeling of dynamical systems -- can generalize to unexplored regions of state space without explicit structural priors. First, we describe a multiple-trajectory training scheme for reservoir computers that supports training across a collection of disjoint time series, enabling effective use of available training data. Then, applying this training scheme to multistable dynamical systems, we show that RCs trained on trajectories from a single basin of attraction can achieve out-of-domain generalization by capturing system behavior in entirely unobserved basins.
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Energy (0.93)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Masked Generative Adversarial Networks are Data-Efficient Generation Learners Supplemental Materials
Specifically, with limited training data, the discriminator tends to discriminate via meaningless shortcuts by merely focusing on easy-to-discriminate image locations and spectra instead of holistic understanding of images. Specifically, it shows a randomly initialized model produces very even spatial attentions, i.e., a Gini The MaskedGAN can be modeled as an instance of the Two Time-Scale Update Rule. Eq.2) can be re-written as follows, respectively: L By using the Sum Rule of Integration, we can simplify Eq.7 and Eq.18 as follows: L Based on Eq.20 and Eq.23, we can derive the gradients of the generator loss functions By comparing Eq.25 and Eq.28, we have: Then, let us consider the convergence of the proposed MaskedGAN. So far, we have proved the convergence of the proposed MaskedGAN. In MaskedGAN, the mask ratio in generator training controls the learning pace of generator by masking out a portion of training signals back-propagated from discriminator.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > Germany (0.04)
- Asia > Singapore (0.04)
Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse
Zhang, Yuheng, Yao, Wenlin, Yu, Changlong, Liu, Yao, Yin, Qingyu, Yin, Bing, Yun, Hyokun, Li, Lihong
Large language models (LLMs) have achieved impressive reasoning performance, with reinforcement learning with verifiable rewards (RLVR) emerging as a standard paradigm for post-training. A representative algorithm, group relative policy optimization (GRPO) (Shao et al., 2024), computes advantages by normalizing outcome rewards within response groups, but suffers from a vanishing advantage issue when all responses in a group receive identical rewards. To address this issue, we propose Adaptive Rollout and Response Reuse Policy Optimization (AR3PO), a sampling efficient RLVR algorithm that introduces two novel techniques: adaptive rollout, which dynamically allocates more responses to difficult prompts while saving computation on easier ones, and response reuse, which leverages previously generated correct responses to provide useful training signals. We compare AR3PO with strong RLVR baselines on multiple representative benchmarks using two different families of base models. Across the 7B and 8B models, AR3PO consistently outperforms GRPO and matches or surpasses DAPO (Yu et al., 2025), reducing rollout cost by up to 4.2x. On the larger 32B model, AR3PO achieves comparable performance to DAPO at similar training steps while maintaining substantially lower rollout cost.
- North America > United States (0.14)
- Asia > Middle East > Israel (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech (0.95)
- Information Technology > Artificial Intelligence > Vision (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)